15 research outputs found

    Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance

    Full text link
    "© 2019 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works."[EN] To support the massive amount of memory accesses that GPGPU applications generate, GPU memory hierarchies are becoming more and more complex, and the Last Level Cache (LLC) size considerably increases each GPU generation. This paper shows that counter-intuitively, enlarging the LLC brings marginal performance gains in most applications. In other words, increasing the LLC size does not scale neither in performance nor energy consumption. We examine how LLC misses are managed in typical GPUs, and we find that in most cases the way LLC misses are managed are precisely the main performance limiter. This paper proposes a novel approach that addresses this shortcoming by leveraging a tiny additional Fetch and Replacement Cache-like structure (FRC) that stores control and coherence information of the incoming blocks until they are fetched from main memory. Then, the fetched blocks are swapped with the victim blocks (i.e., selected to be replaced) in the LLC, and the eviction of such victim blocks is performed from the FRC. This approach improves performance due to three main reasons: i) the lifetime of blocks being replaced is enlarged, ii) the main memory path is unclogged on long bursts of LLC misses, and iii) the average LLC miss latency is reduced. The proposal improves the LLC hit ratio, memory-level parallelism, and reduces the miss latency compared to much larger conventional caches. Moreover, this is achieved with reduced energy consumption and with much less area requirements. Experimental results show that the proposed FRC cache scales in performance with the number of GPU compute units and the LLC size, since, depending on the FRC size, performance improves ranging from 30 to 67 percent for a modern baseline GPU card, and from 32 to 118 percent for a larger GPU. In addition, energy consumption is reduced on average from 49 to 57 percent for the larger GPU. These benefits come with a small area increase (by 7.3 percent) over the LLC baseline.This work has been supported by the Spanish Ministerio de Ciencia, Innovacion y Universidades and the European ERDF under Grants T-PARCCA (RTI2018-098156-B-C51), and TIN2016-76635-C2-1-R (AEI/ERDF, EU), by the Universitat Politecnica de Valencia under Grant SP20190169, and by the gaZ: T58_17R research group (Aragon Gov. and European ESF).Candel-Margaix, F.; Valero Bresó, A.; Petit Martí, SV.; Sahuquillo Borrás, J. (2019). Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance. IEEE Transactions on Computers. 68(10):1442-1454. https://doi.org/10.1109/TC.2019.2907591S14421454681

    Mitigación de ataques de canal lateral basados en caracterización térmica y eléctrica

    Get PDF
    Los ataques de canal lateral se han convertido en una vulnerabilidad de creciente importancia en las plataformas multinúcleo y, especialmente, en entornos cloud. Estos ataques aprovechan vulnerabilidades de la implementación física de un sistema en lugar de actuar sobre los algoritmos que ejecutan.Este trabajo se centra en la monitorización térmica, energética y, sobre todo, en la transmisión a través de canales laterales térmicos. La transmisión térmica se basa en que un núcleo puede inducir cambios de temperatura en los de su alrededor, lo cual puede ser utilizado para transmitir información sigilosamente entre contenedores aislados y desplegados en núcleos físicos diferentes.Con el objetivo de una mayor comprensión de estos ataques, así como para diseñar medidas de mitigación efectivas, se lleva a cabo una caracterización experimental y exhaustiva del ataque sobre un servidor multinúcleo real como los empleados por cualquier proveedor cloud. De esta manera, se identifica y se analiza cuantitativamente el impacto de los factores principales que afectan al éxito del ataque, como es el caso de la elección de los núcleos emisores y receptores en la creación del canal, la tasa de transmisión de datos o el nivel de carga de trabajo en el sistema. El ataque se ha completado con éxito total en máquinas reales, con virtualización por contenedores y parcialmente en máquinas con virtualización de sistema.Partiendo de la caracterización anterior, se implementa una prueba de concepto para demostrar su impacto y su efectividad como método de extracción de información sensible. Finalmente, se desarrollan y evalúan módulos del kernel de Linux que mitigan por completo el ataque.<br /

    Caracterización del envejecimiento en los bancos de registros de microprocesadores x86

    Get PDF
    La continua miniaturización del transistor compromete la fiabilidad de los circuitos digitales. Entre los efectos que degradan el transistor a lo largo del tiempo, destacan los efectos Bias Temperature Instability (BTI) y Hot Carrier Injection (HCI). Ambos efectos aumentan el voltaje umbral de los transistores, lo cual puede provocar fallos en la ejecución de las aplicaciones, acortando el tiempo de vida útil de los microprocesadores.Los componentes de un procesador más susceptibles a los efectos BTI y HCI son aquellos que integran un mayor número de transistores, se utilizan continuamente y con frecuencia, y tienen un impacto directo en el rendimiento del sistema. En este trabajo, se caracterizan ambos efectos de envejecimiento sobre el banco de registros de un procesador x86 atendiendo a los valores almacenados en esta estructura de memoria. Para ello, se instrumenta un simulador ciclo-a-ciclo que permite emular el comportamiento de un procesador Skylake y extraer estadísticas de uso relacionadas con la degradación de los transistores a partir de la ejecución de un conjunto de aplicaciones científicas.Los resultados experimentales muestran diferentes patrones de degradación dependiendo de la aplicación ejecutada, aunque se pueden extraer características comunes que contribuyen a acelerar la degradación del banco de registros. Por ejemplo, la mayoría de aplicaciones almacenan una gran cantidad de bytes nulos y direcciones de memoria que afectan a celdas específicas de los registros. A partir de un modelo analítico, se comprueba que los transistores de estas celdas sufren el mayor aumento de voltaje umbral al cabo de un uso continuado durante tres años.El estudio de caracterización presentado en este trabajo permitirá diseñar mecanismos arquitectónicos capaces de mitigar o distribuir el desgaste de los transistores de manera homogénea por todo el banco de registros, alargando sustancialmente el tiempo de vida del procesador.<br /

    Exploiting Reuse Information to Reduce Refresh Energy in On-Chip eDRAM Caches

    Full text link
    © Owner/Author 2013. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ICS '13 Proceedings of the 27th international ACM conference on International conference on supercomputing; http://dx.doi.org/10.1145/2464996.2467278.This work introduces a novel refresh mechanism that leverages reuse information to decide which blocks should be refreshed in an energy-aware eDRAM last-level cache. Experimental results show that, compared to a conventional eDRAM cache, the energy-aware approach achieves refresh energy savings up to 71%, while the reduction on the overall dynamic energy is by 65% with negligible performance losses.This work was supported by the Spanish Ministerio de Economía y Competitividad (MINECO) and Plan E funds, under Grants TIN-2009-14475-C04-01 and TIN2012-38341-C04-01.Valero Bresó, A.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2013). Exploiting Reuse Information to Reduce Refresh Energy in On-Chip eDRAM Caches. ACM. https://doi.org/10.1145/2464996.2467278

    An Aging-Aware GPU Register File Design Based on Data Redundancy

    Get PDF
    "© 2019 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works."[EN] Nowadays, GPUs sit at the forefront of high-performance computing thanks to their massive computational capabilities. Internally, thousands of functional units, architected to be fed by large register files, fuel such a performance. At deep nanometer technologies, the SRAM memory cells that implement GPU register files are very sensitive to the Negative Bias Temperature Instability (NBTI) effect. NBTI ages cell transistors by degrading their threshold voltage Vth over the lifetime of the GPU. This degradation, which manifests when a cell keeps the same logic value for a relatively long period of time, compromises the cell read stability and increases the transistor switching delay, which can lead to wrong read values and eventually exceed the processor cycle time, respectively, so resulting in faulty operation. Thiswork proposes architectural mechanisms leveraging the redundancy of the data stored in GPU register files to attack NBTI aging. The proposed mechanisms are based on data compression, power gating, and register address rotation techniques. All these mechanismsworking together balance the distribution of logic values stored in the cells along the execution time, reducing both the overall Vth degradation and the increase in the transistor switching delays. Experimental results show that a conventional GPU register file suffers the worst case for NBTI, since a significant fraction of the cells maintain the same logic value during the entire application execution (i.e., a 100 percent '0' and '1' duty cycle distributions). On average, the proposal reduces these distributions by 58 and 68 percent, respectively, which translates into Vth degradation savings by 54 and 62 percent, respectively.This work was supported by the Gobierno de Aragon and the European ESF (gaZ: T58_17R research group), and by the Ministerio de Economia y Competitividad (MINECO) and AEI/FEDER (EU) funds under Grants TIN2016-76635-C2-1-R and TIN2015-66972-C5-1-R.Valero Bresó, A.; Candel-Margaix, F.; Suárez-Gracia, D.; Petit Martí, SV.; Sahuquillo Borrás, J. (2019). An Aging-Aware GPU Register File Design Based on Data Redundancy. IEEE Transactions on Computers. 68(1):4-20. https://doi.org/10.1109/TC.2018.2849376S42068

    Design, performance, and energy consumption of eDRAM/SRAM macrocells for L1 data caches

    Full text link
    (c) 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other worksSRAM and DRAM have been the predominant technologies used to implement memory cells in computer systems, each one having its advantages and shortcomings. SRAM cells are faster and require no refresh since reads are not destructive. In contrast, DRAM cells provide higher density and minimal leakage energy since there are no paths within the cell from Vdd to ground. Recently, DRAM cells have been embedded in logic-based technology (eDRAM), thus overcoming the speed limit of typical DRAM cells. In this paper, we propose a hybrid n-bit macrocell that implements one SRAM cell and n-1 eDRAM cells. This cell is aimed at being used in an n-way set-associative first-level data cache. Architectural mechanisms (e.g., special writeback policies) have been devised to completely avoid refresh logic. Performance, energy, and area have been analyzed in detail. Experimental results show that using typical eDRAM capacitors, and compared to a conventional cache, a 4-way set-associative hybrid cache reduces both energy consumption and area up to 54 and 29 percent, respectively, while having negligible impact on performance (less than 2 percent).This work was supported by the Spanish Ministerio de Ciencia e Innovacion (MICINN), and jointly financed with Plan E funds under Grant TIN2009-14475-C04-01 and by Consolider-Ingenio 2010 under Grant CSD2006-00046.Valero Bresó, A.; Petit Martí, SV.; Sahuquillo Borrás, J.; López Rodríguez, PJ.; Duato Marín, JF. (2012). Design, performance, and energy consumption of eDRAM/SRAM macrocells for L1 data caches. IEEE Transactions on Computers. 61(9):1231-1242. https://doi.org/10.1109/TC.2011.138S1231124261

    Impact on performance and energy of the retention time and processor frequency in L1 macrocell-based data caches

    Full text link
    [EN] Cache memories dissipate an important amount of the energy budget in current microprocessors. This is mainly due to cache cells are typically implemented with six transistors. To tackle this design concern, recent research has focused on the proposal of new cache cells. An n-bit cache cell, namely macrocell, has been proposed in a previous work. This cell combines SRAM and eDRAM technologies with the aim of reducing energy consumption while maintaining the performance. The capacitance of eDRAM cells impacts on energy consumption and performance since these cells lose their state once the retention time expires. On such a case, data must be fetched from a lower level of the memory hierarchy, so negatively impacting on performance and energy consumption. As opposite, if the capacitance is too high, energy would be wasted without bringing performance benefits. This paper identifies the optimal capacitance for a given processor frequency. To this end, the tradeoff between performance and energy consumption of a macrocell-based cache has been evaluated varying the capacitance and frequency. Experimental results show that, compared to a conventional cache, performance losses are lower than 2% and energy savings are up to 55% for a cache with 10 fF capacitors and frequencies higher than 1 GHz. In addition, using trench capacitors, a 4-bit macrocell reduces by 29% the area of four conventional SRAM cells.This work was supported in part by Spanish CICYT under Grant TIN2009-14475-C04-01, by Consolider-Ingenio 2010 under Grant CSD2006-00046, and by European community’s Seventh Framework Programme (FP7/2007-2013) under Grant 289154.Valero Bresó, A.; Sahuquillo Borrás, J.; Lorente Garcés, VJ.; Petit Martí, SV.; López Rodríguez, PJ.; Duato Marín, JF. (2012). Impact on performance and energy of the retention time and processor frequency in L1 macrocell-based data caches. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 20(6):1108-1117. https://doi.org/10.1109/TVLSI.2011.2142202S1108111720

    Combining RAM technologies for hard-error recovery in L1 data caches working at very-low power modes

    Full text link
    ©2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Low-power modes in modern microprocessors rely on low frequencies and low voltages to reduce the energy budget. Nevertheless, manufacturing induced parameter variations can make SRAM cells unreliable producing hard errors at supply voltages below Vccmin. Recent proposals provide a rather low fault-coverage due to the fault coverage/overhead trade-off. We propose a new faulttolerant L1 cache, which combines SRAM and eDRAM cells in L1 data caches to provide 100% SRAM hard-error fault coverage. Results show that, compared to a conventional cache and assuming 50% failure probability at low-power mode, leakage and dynamic energy savings are by 85% and 62%, respectively, with a minimal impact on performance.This work was supported by the Spanish MICINN (TIN2010-18368) with the Consolider-Ingenio 2010 Programme co-funded by the European Commission FEDER funds (CSD2006-00046) and co-funded with the Plan E funds (TIN2009-14475-C04-01). Additionaly, it was supported by Generalitat de Catalunya (2009SGR1250), by FP7 program of the European Commission (TRAMS-248789), and by Spanish MINECO (TIN2012-38341-C04-01).Lorente Garcés, VJ.; Valero Bresó, A.; Sahuquillo Borrás, J.; Petit Martí, SV.; Canal, R.; López Rodríguez, PJ.; Duato Marín, JF. (2013). Combining RAM technologies for hard-error recovery in L1 data caches working at very-low power modes. IEEE, ACM. https://doi.org/10.7873/DATE.2013.031

    Hybrid caches: design and data management

    Full text link
    Cache memories have been usually implemented with Static Random-Access Memory (SRAM) technology since it is the fastest electronic memory technology. However, this technology consumes a high amount of leakage currents, which is a major design concern because leakage energy consumption increases as the transistor size shrinks. Alternative technologies are being considered to reduce this consumption. Among them, embedded Dynamic RAM (eDRAM) technology provides minimal area and leakage by design but reads are destructive and it is not as fast as SRAM. In this thesis, both SRAM and eDRAM technologies are mingled to take the advantatges that each of them o¿ers. First, they are combined at cell level to implement an n-bit macrocell consisting of one SRAM cell and n-1 eDRAM cells. The macrocell is used to build n-way set-associative hybrid ¿rst-level (L1) data caches having one SRAM way and n-1 eDRAM ways. A single SRAM way is enough to achieve good performance given the high data locality of L1 caches. Architectural mechanisms such as way-prediction, swaps, and scrub operations are considered to avoid unnecessary eDRAM reads, to maintain the Most Recently Used (MRU) data in the fast SRAM way, and to completely avoid refresh logic. Experimental results show that, compared to a conventional SRAM cache, leakage and area are largely reduced with a scarce impact on performance. The study of the bene¿ts of hybrid caches has been also carried out in second-level (L2) caches acting as Last-Level Caches (LLCs). In this case, the technologies are combined at bank level and the optimal ratio of SRAM and eDRAM banks that achieves the best trade-o¿ among performance, energy, and area is identi¿ed. Like in L1 caches, the MRU blocks are kept in the SRAM banks and they are accessed ¿rst to avoid unnecessary destructive reads. Nevertheless, refresh logic is not removed since data locality widely di¿ers in this cache level. Experimental results show that a hybrid LLC with an eighth of its banks built with SRAM technology is enough to achieve the best target trade-o¿. This dissertation also deals with performance of replacement policies in heterogeneous LLCs mainly focusing on the energy overhead incurred by refresh operations. In this thesis it is de¿ned a new concept, namely MRU-Tour (MRUT), that helps estimate reuse information of cache blocks. Based on this concept, it is proposed a family of MRUTbased replacement algorithms that randomly select the victim block among those having a single MRUT. These policies are enhanced to leverage recency of information for a few blocks and to adapt to changes in the working set of the benchmarks. Results show that the proposed MRUT policies, with simpler hardware complexity, outperform the Least Recently Used (LRU) policy and a set of the most representative state-of-the-art replacement policies for LLCs. Refresh operations represent an important fraction of the overall dynamic energy consumption of eDRAM LLCs. This fraction increases with the cache capacity, since more blocks have to be refreshed for a given period of time. Prior works have attacked the refresh energy taking into account inter-cell feature variations. Unlike these works, this thesis proposes a selective refresh policy based on the MRUT concept. The devised policy takes into account the number of MRUTs of a block to select whether the block is refreshed. In this way, many refreshes done in a typical distributed refresh policy are skipped (i.e., in those blocks having a single MRUT). This refresh mechanism is applied in the hybrid LLC memory. Results show that refresh energy consumption is largely reduced with respect to a conventional eDRAM cache, while the performance degradation is minimal with respect to a conventional SRAM cache.Valero Bresó, A. (2013). Hybrid caches: design and data management [Tesis doctoral]. Editorial Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/32663TESISPremiad

    Disseny de caches de dades L1 mitjançant tecnologia SRAM i eDRAM

    Full text link
    Valero Bresó, A. (2011). Disseny de caches de dades L1 mitjançant tecnologia SRAM i eDRAM. http://hdl.handle.net/10251/15676Archivo delegad
    corecore